Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add dataset processors #184

Merged
merged 38 commits into from
Nov 14, 2024
Merged

✨ Add dataset processors #184

merged 38 commits into from
Nov 14, 2024

Conversation

arxyzan
Copy link
Member

@arxyzan arxyzan commented Nov 14, 2024

Pull Request

Description

This PR adds support for dataset processors which are a group of callable classes that can be used as map functions for regular 🤗 Datasets. Note that you could pass 🤗Datasets to Hezar's Trainer already. This PR adds a general set of classes that can reproduce the same results that Hezar Dataset subclasses would produce.

Changes

  • Add dataset processors classes
  • Add docs for dataset processors
  • Add some other minor changes and fixes
  • Add some minor fixes in tokenizer
  • Remove some args from tokenizer config

Related Issues

Checklist

  • I have read and followed the project's contributing guidelines.
  • My code follows the project's coding style.
  • I have tested my changes thoroughly.
  • I have updated the documentation if necessary.
  • All existing tests pass.
  • I have added new tests to cover my changes.
  • My changes do not introduce any new warnings or errors.

Additional Comments

Reviewer Instructions

Author's Note

…tasets-map-processing

# Conflicts:
#	hezar/data/dataset_processors/sequence_labeling_processor.py
…and `ImageProcessor`

Only applies when `return_tensors`="list"
@arxyzan arxyzan changed the title Datasets map processing ✨ Add dataset processors Nov 14, 2024
@arxyzan arxyzan merged commit fa54504 into main Nov 14, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant